Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

data are in the format of peptides. A peptide is composed of non-

l attributes such as the amino acids or the nucleic acids. A proper

of these non-numerical data called the encoding process must be

ed because not all the machine learning algorithms can accept

erical inputs. In addition to the commonly used binary encoding,

ter has introduced an alternative encoding approach which has

l significance. This novel encoding method is called the bio-basis

This alternative does not treat the residues of a peptide as the

ent variables. Just like the move from the 1980s’ edit distance-

quence homology alignment to the 1990s’ mutation-based

y alignment, the introduction of the bio-basis function has

he binary encoding method. When being integrated with different

ant analysis algorithms, the bio-basis function can be efficiently

a better protease cleavage pattern discovery as shown in several

his chapter. Importantly, they show different outstanding features

ase cleavage pattern discovery. For instance, the mixture bio-

nefits the use of multiple mutation matrices, leading to the

d performance in peptide cleavage pattern discovery. Through the

on between the bio-basis function with the random forest

m, the cleaved peptides which are most close to the probable

e can be discovered. This thus provides some important

on for the efficient inhibitor design. Besides, this cutting-edge

can be well used for similar biological/medical pattern discovery